Crash severity analysis and risk factors identification based on an alternate data source: a case study of developing country

Road traffic injuries are one of the primary reasons for death, especially in developing countries like Bangladesh. Safety in land transport is one of the major concerns for road safety authorities and other policymakers. For this reason, contributory factors identification associated with crashes is necessary for reducing road crashes and ensuring transportation safety. This paper presents an analytical approach to identifying significant contributing factors of Bangladesh road crashes by evaluating the road crash data, considering three different severity levels (non-fetal, severe, and extremely severe). Generally, official crash databases are compiled from police-reported crash records. Though the official datasets are focusing on compiling a wide array of attributes, an assorted number of unreported issues can be observed that demands an alternative source of crash data. Therefore, this proposed approach considers compiling crash data from newspapers in Bangladesh which could be complimentary to the official crash database. To conduct the analysis, first, we filtered the useful features from compiled crash data using three popular feature selection techniques: chi-square, Two-way ANOVA, and Regression analysis. Then, we employed three machine learning classifiers: Decision Tree, Random Forest, and Naïve Bayes over the extracted features. A confusion matrix was considered to evaluate the proposed model, including classification accuracy, sensitivity, and specificity. The predictive machine learning model, namely, Random Forest using Label Encoder with chi-square and Two-way ANOVA feature selection process, seems the best option for crash severity prediction that provides high prediction accuracy. The resulting model highlights nine out of fourteen independent features as responsible factors. Significant features associated with crash severities include driver characteristics (gender, license type, seat belts), vehicle characteristics (vehicle type), road characteristics (road surface type, road classification), environmental conditions (day of crash occurred, time of crash), and injury localization. This outcome may contribute to improving traffic safety of Bangladesh.


Background study
Predicting crash severity and identifying the responsible factors are two significant issues of traffic safety research. Therefore, various approaches have been developed and implemented for crash severity prediction and identification of influential factors. A summary of the most recent related research on crash severity analysis and prediction is provided in Table 1. The information presented in Table 1 includes the research method, severity level, features, performance metrics, and prediction results. A detailed review of all the relevant studies is beyond the scope of this study. In summarizing the previous literature results, we have mostly focused on safety studies employing machine learning-based approaches while also considering a few studies from traditional statistical approaches for comparison purposes.
Researchers have employed several approaches to identifying the responsible factors, including econometric models, machine learning, and data mining frameworks. Applying traditional statistical and econometric models remains a workhorse in existing safety literature to identify the relevant features of crash risk and crash severity. Specifically, researchers have employed multinomial logit/probit, ordered logit/probit, count regression techniques, and generalized forms as random parameters and models for systematic heterogeneity aspects [10][11][12] .
With the emergence of advanced computing power, safety researchers have recently focused on the applications of machine learning approaches as an alternative analytical approach to modelling crash risk and severity. Several approaches were adopted using Artificial Neural Network 13 , Support Vector Machine 14 , and Logistic Regression 15 . Previous studies employed Decision Tree algorithms to analyze crash severity 10,13,14 . Advanced versions of Decision Tree such as Random Forest 16 , C4.5 algorithm 17 , Classification and Regression Tree 9 , and Multivariate Adaptive Regression Splines 1 have also been identified to be adopted by several researchers.
Nevertheless, some past studies revealed that machine learning models have limitations in observing the correlation between the input and responsible variables. These studies also pointed out the limited concern on the feature selection process resulted in poor accuracy in crash severity prediction with the machine learning algorithm. Therefore, Pillajo-Quijia et al. 9 emphasized the importance of feature selection and concluded that feature selection might improve the accuracy of crash severity prediction with machine learning algorithms. Rezapour et al. 17 added that feature elimination could improve the accuracy of several machine learning algorithms, including Random Forest and Support Vector Machine. Inspired by these studies, Ghandour et al. 18 implemented a feature selection technique using chi-square, which has significant importance in improving the accuracy of machine learning classifiers.
Several past studies implemented Logistic Regression [16][17][18][19][20][21] , Random Forest 8,9,16,[19][20][21] , Classification Tree 1 and C4.5 algorithm 21 for crash severity prediction in different severity levels (i.e., serious or fatal/accident or nonaccident/ possible injury or property damage, no-injury or minor injury or severe injury). These studies identified several influential factors, including driver characteristics (i.e., gender, age, residency, speed limit compliance, driver conditions); environmental characteristics (i.e., weather conditions, road conditions, and lighting conditions); roadway characteristics (i.e., roadway surface condition, crash location include vertical and horizontal characteristics of the segment and the posted speed limit of the segment). Fiorentini and Losa 19 showed that the Logistic Regression model performs better in predicting property damage and fatal crashes with 85.74% accuracy. Similarly, Mafi et al. 21 added that the Random Forest model performs well in no-injury, minor injury, or severe injury prediction with 87.15% accuracy. Rezapour et al. 1 and Mafi et al. 21 implemented the Classification Tree and C4.5 model and extracted 70% and 75.7% accuracy, respectively 17 .
Some studies adopted other machine learning models, namely Support Vector Machine 1,9,13,22 , k-nearest neighbor 19,23 , and Naïve Bayes 18,24 to analyze a wide range of crash factors such as vehicle characteristics (i.e., vehicle age, condition of vehicle, weight of the vehicle, group of light trucks and vans, occupants involved, the body type of the vehicle), road infrastructure (road function, lane width, shoulder type, accident location), environmental condition (i.e., sight distance, lighting condition, weather, crash time, month/season) and temporal characteristics (i.e., crash type, time of day, day of the week, annual average daily traffic, type of separator, roadway terrain, left shoulder width, and right shoulder width, number of vehicles). Delen et al. 13 revealed that Support Vector Machine performs better in low or high injury clarification and prediction with 90.41% accuracy. Fiorentini and Losa 19 added that the k-nearest neighbors algorithm performs well in identifying the fatal injury and property damage with 78.53% prediction accuracy.
Several studies investigated the impacts of several factors by using XGBoost, Artificial Neural Network/Feedforward Neural Networks/Multi-layer perceptron, and Mixed Logit model to predict crash severity and injury severity. Guo et al. 25 suggested that XGBoost algorithm performance prediction is higher than the Artificial Neural Network/Feed-forward Neural Networks algorithm with 80.35% accuracy. Uddin and Huynh 26 revealed that the Mixed Logit model performs better than other machine learning models with 99% accuracy in crash analysis and prediction. Wahab and Jiang 14 and Assi et al. 15 concluded that the multi-layer perceptron model performs better than the feed-forward neural networks model with 72.16% and 71.8% accuracy, respectively.
Additionally, few researchers employed Classification and Regression Tree, Bayesian additive regression trees, and the Simple Cart model to investigate additional responsible factors. Pillajo-Quijia et al. 11 and Mondal et al. 10 developed a model based on the Classification and Regression Trees and Bayesian Additive Regression Trees. They predicted crash severity based on weekdays, months, type of road surface construction, and lane width with 78% and 61% accuracy, respectively. Wahab and Jiang 14 developed a model based on the Projective Adaptive Resonance Theory and Simple Cart algorithm model. The study found that Simple Cart performs better than other statistical models. While the previous studies performed well with regard to traditional machine learning models, few of these studies focused on the feature selection process in crash severity analysis and prediction 1,26 .
Hence the significant outcome of the discussed works is noticeable. It is noteworthy to mention that it has already been argued in several past studies that the feature selection process improves the classification and prediction of models in the context of crash severity analysis. Without a feature selection step, the model may lead www.nature.com/scientificreports/ to poor results with inappropriate predictions and result in inappropriate countermeasures specific to the crashresponsible factors. Following this, a few studies implemented several feature selection techniques to improve the crash severity analysis in terms of classification and prediction results 1,11,17,18 . However, these studies relied on a single feature selection technique/algorithm, while considering multiple feature selection techniques might alter the final result. Moreover, the majority of the existing studies consider the official databases that are freely accessible to the research community, though an assorted number of issues associated with unreported data are observed. Therefore, an alternative source of official database is an emerging need in this current era. In these drawbacks, the significance of the proposed approach is aligned with the analysis of the crash data compiled from newspaper data as an alternative source of official data considering multiple feature selection techniques and machine learning algorithms to represent the crash severity statistics and identify the associated risk factors that are responsible for serious crashes in developing countries like Bangladesh.

Study design
Data gathering. The major focus of this study is to develop and propose a data-scrapping algorithm for developing a compiled crash database from different Newspaper articles. In this study, we have compiled crash records reported in several newspapers of Bangladesh for the year 2019. We opted for three newspapers: Daily Prothom Alo (printed and e-paper), Daily Jugantor (printed and e-paper), and Bdnews24 (e-paper/online por- www.nature.com/scientificreports/ tal) according to their popularity, daily circulation number, and readers on e-portal. As of May 2021, the Daily Prothom Alo, and the Daily Jugantor are the oldest newspapers in Bangladesh, and their circulation numbers are 501,800 and 290,200, respectively. Besides, according to the statistics of July 2015, the number of readers on the Bdnews24 e-portal is 6.6 million. Therefore, these three newspapers are the most circulated and popular newspaper that covers a wide range of news, including crash reports which led us to consider these three newspapers for compiling the crash data in this study. The proposed approach can further be augmented by compiling crash information from other newspapers. The crash data compilation algorithm and process from the daily newspaper are presented in Fig. 1 and are discussed in the following sections.
Tier 1-data scraping and database creation. To create a Bangladesh road crash database, we first searched several potential sources of data where the crash reports were published. We selected three top newspapers to identify the road crash context data. For that, we created a context keyword set for road crashes where different www.nature.com/scientificreports/ words were stored to match the news keywords, such as road accident and road crash. After finding the semantic similarity (WordNet similarity), the news article was included in the road crash database if the similarity value was greater than the threshold value of 0.7.
Tier 2-keyword extraction and matching. Data preprocessing was an important part of keyword extraction to omitted words like 'the' , 'as' , and 'are' . After completing the preprocessing steps (data cleaning, integration, transformation, reduction, and transformation), we extracted keywords from the preprocessed text, such as "occurred" and "accident". In the meantime, a road crash information context basket was created to find the semantic similarity between the extracted keywords and the basket words. For instance, the driver's information basket contains the driver's age and gender information. The vehicle information basket includes vehicle age and type-related information. If the keyword (WordNet similarity value) is greater than the threshold value, then the keyword was used for tier 3. In parallel, we evaluated the severity level by using a context keyword set of severity levels. www.nature.com/scientificreports/ Tier 3-features extraction for final dataset. The features were selected based on the output of tier 2. We used the crash type and responsible factors of crash severity (Fig. 1) as the key features to create the final dataset for the proposed model. Based on the three major Newspaper articles, the newspaper archived crash database collected for the Year 2019 across Bangladesh contains 441 crash records while also reporting information relevant to vehicle characteristics, environmental characteristics, driver characteristics, road characteristics, and residential location characteristics. The identified features are summarized in Table 2. The crash severity was reported in the database as a three-point severity scale variable: non-fatal injury, severe injury, and extremely severe injury. From Table 2, it could be observed that, among 441 records, 2.26% of crashes were non-fatal, 73.92% were severe, and 23.80% were extremely severe injuries.
Data pre-processing. Data preprocessing is an essential element of data analysis to assure the quality and reliability of the result. Therefore, we performed some data preprocessing before implementing the model. The preprocessing was performed by removing duplicate records and eliminating unexpected notations. After completing the preprocessing, fifteen features were initially selected for further analysis. Fourteen features were considered as input attributes, and crash severity outcome was considered as the output of this study. Crash severity is defined as three-point variables: non-fatal crash (if there were zero fatalities), severe crash (if one or two crash victims were fatally injured), and extremely severe crash (if three or more than three crash victims were fatally injured). Table 2 shows the input and output features with their detailed description.
Analytical context of current study. The major focus of studies analyzing crash data in existing safety literature is identifying the responsible features of crash risk and crash severity outcomes. Our current study contributes towards the second stream of studies. Several studies examined both intrinsic and extrinsic factors related to crashes for crash severity analysis. These studies have identified a multitude of factors to be responsible for crash severity outcomes, including highway functional class, roadway geometry, demographics of road users, weather condition, and other situational attributes 15,17,23,25,27 . Several existing studies investigated the impacts of driver characteristics, vehicle characteristics, and pedestrian characteristics in understanding the crash severity mechanism 11,28 . A summary of these responsible factors of crash severity outcomes is illustrated in Fig. 2.
The prime objective of this study is to determine the importance of feature selection techniques in the crash severity prediction of Bangladesh by using the crash information collected from daily newspapers for the year 2019. Specifically, this study aims to apply machine learning-based algorithms to provide a detailed understanding of responsible factors for crash severity outcomes. Our preliminary observation from these studies is that the Decision Tree, Random Forest, and Naïve Bayes approaches are frequently applied machine learning algorithms www.nature.com/scientificreports/ in road safety research for their regularization parameter. These algorithms are important in avoiding over-fitting, better performance with imbalanced data, and non-linearity for input and target variables 22 . Among several feature selection techniques, chi-square, two-way ANOVA, and regression analysis have also been considered for their less complexity, effectiveness, and cost-sensitive properties. Therefore, in our crash severity analysis and prediction model, three machine learning techniques, namely Decision Tree, Random Forest, and Naïve Bayes, with three feature selection techniques (chi-square, Two-way ANOVA test, and Regression analysis) have been adopted for crash severity prediction. Generally, Decision Tree, Random Forest, and Naïve Bayes are traditional and popular algorithms in machine learning research. Delen et al. 13 reported that Decision Tree provides better results compared to other models (such as the Probit model) introduced by Mondal et al. 10 . Similarly, some previous studies have implemented a Random Forest algorithm for crash severity analysis and prediction. These studies concluded that Random Forest performed better than other machine learning algorithms on both small and large datasets in crash severity identification. Beyond these traditional algorithms, we have also adopted the Naive Bayes technique, including Multinomial Naïve Bayes and Gaussian Naïve Bayes techniques. Naïve Bayes techniques are likely to be an effective classifier with a large amount of data in the automatic learning process. It is easy to implement and helps to improve the efficiency of the final result. The effectiveness of the proposed approach has been evaluated through precision, F1 score, classification accuracy, sensitivity, and specificity to understand the crash severity outcome and responsible factors.

Crash severity analysis and responsible factors prediction
In this section, we present the crash severity prediction model to identify responsible factors that are related to crash severity outcomes. The workflow diagram of the proposed crash severity prediction model is illustrated in Fig. 3. The crash severity and prediction analysis were performed in four phases: data input (Phase-I), feature Feature analysis approach. This section considers feature selection techniques (Fig. 3, Phase-II) to evaluate the input feature's importance in predicting the actual output (crash severity outcome). This study considered chi-square statistics, Two-way ANOVA statistics, and Regression analysis statistics to identify the responsible features for executing our proposed model. A brief explanation of feature selection statistics is given below.
Chi-square. Chi-square is a statistical measurement of two dependent and independent attributes. It is also known as the categorical feature evaluation technique, which provides the correlation information of two attributes to show the difference between two sets of data (observed and target dataset). Equation (1) shows the process of chi-square statistic calculation where "c" is the degree of freedom, "O" refers to the observed value and "E" refers to the expected value.  www.nature.com/scientificreports/ To evaluate the chi-square statistic, we consider the chi-square critical value under alpha level 0.05 (5%) and set our threshold value to α = 0.05 and found six features as the most correlated features such as Day of Crash, Residential Location, Vehicle Type, License Type, Seat Belts, and Gender. Figure 4 shows the graphical representation of the most correlated and less correlated features identified in the crash dataset with their corresponding chi-square scores.
Two-way ANOVA. We consider the Two-Way ANOVA Test statistic to identify the substantial impact in two groups of data (input and target). Two-way ANOVA helps determine whether the null hypothesis should be accepted or rejected.
(1)  www.nature.com/scientificreports/ To evaluate the null hypothesis, we consider p values. p value was classified through a significance level (ð) where, if the p value was greater than the significance level (p value > ð), then there was no significant impact between the two groups of data. Similarly, if the p value was less than the significance level (p value < ð), there was a significant impact between the two data groups. We set the significance level ð = 0.05 and found seven significant features: Vehicle Type, Day of Crash, License Type, Road Surface Type, Road Classification, Seat Belts, and Time of Crash. Figure 5 shows the Two-way ANOVA test result of significant features with their p value from the fourteen input features.
Regression analysis. Regression analysis is a statistical method to evaluate the features and understand the impact of each input feature on the output feature. It helps to determine the features that we should consider. Besides, it also helped get the desired output or better model performance. Therefore, through regression analysis, our prime goal was to identify the most important features or factors of data. To evaluate the importance of input features through Regression analysis, we consider the p value. To evaluate p values, a threshold value was set to α = 0.05 and found five features responsible for crashes: Gender, License Type, Number of Lanes, Seat Belts, and Vehicle Type. Figure 6 shows the responsible feature identification of the BD crash dataset through regression analysis.
Model specification. This section describes several machine learning classifiers or predictive models (Fig. 3, Phase-III) that are effective in handling crash severity models. In this study, three machine learning models were considered: Decision Tree, Random Forest, and Naïve Bayes (Multinomial and Gaussian Naïve Bayes). These machine learning models had been implemented using the Python framework through a collaboratory environment with Python data learning libraries, such as Pandas and scikit-learn. As in our crash dataset, all the features are categorical, so encoding categorical data into numerical data is crucial. Therefore, we implemented a popular and effective encoder, namely Label Encoder, through the Python Sklearn library. Generally, Label Encoder generates a non-repeat numerical value for each feature of the data 12 . After data encoding, our selective machine learning classifiers were implemented for crash severity prediction. A brief discussion about the machine learning classifiers used in this study has been described in the following.
Decision Tree (DT). Since the early 1930s, Decision Tree has been popular and effective for various decisionmaking problems using expert knowledge 18 . In recent days, it has become a popular and effective analytics tool in data mining over other machine learning techniques (i.e., neural network or support vector machine). Decision Tree is a supervised machine learning classifier that works by splitting categorical data to create a model that predicts the value of a target attribute. Therefore, in our model, we used Decision Tree to analyze our categorical features to predict several crash severity outcomes. www.nature.com/scientificreports/ Random Forest (RF). Random Forest is a popular supervised machine learning classifier widely used in several classification and prediction applications. Random Forest machine learning classifiers work by constructing multiple decision trees of the given dataset during the learning phase and learning both the dataset's usual and unusual patterns during the training phase. It predicts high accuracy output for large and small datasets, even for missing data in the dataset. Previous research also added that Random Forest works much better than traditional decision trees. It combines the output of the entire decision tree and calculates their average to reduce overfitting and improve prediction accuracy.
Naïve Bayes (NB). Naïve Bayes classifier is a widely used machine learning classifier that is easy to use for large-scale datasets and fast to predict in multi-class prediction. It is also known as a probabilistic classifier that performs based on Bayes' Theorem. Equation (2) shows the process of Naïve Bayes classifier prediction methods where "x" is the predictor and "c" is the prediction class or target class. Some specialized Naïve Bayes classifiers have recently been introduced for crash severity prediction. Therefore, we considered the two most advanced Naïve Bayes classifiers in this study, namely: Multinomial Naïve Bayes and Gaussian Naïve Bayes classifiers. Multinomial Naïve Bayes is a specialized version of Naïve Bayes classifier which automatically learns from the input features through an automated learning process. Also, it refers to the Multivariate Event Model, which is accurate for making predictions. Generally, it performs through multidimensional and polynomial probability theorems to overcome several data ambiguities. The multidimensional model was used to handle the missing attributes of the data and make a prediction. Besides, the polynomial model was used to record each attribute occurrence to identify the attributes that do not appear in the dataset. Gaussian Naïve Bayes classifier is another specialized version of Naïve Bayes classifier, which deals with multi-class data to predict according to Gaussian distribution parameters. It works by segmenting data according to each class and then computes each class's mean and variance to make predictions of input values (x) associated with observation values (v).
Here, P (c|x) is the posterior probability of the target class, P (c) is the prior probability of the target class, P (x|c) is the probability of the predictor class, P (x) is the prior probability of the predictor class.

K-fold cross-validation. Identifying the actual accuracy of a prediction model induced by a supervised
learning algorithm is effective. It helps to estimate future prediction accuracy and choose a better estimating classifier. The more popular methodology for identifying the actual accuracy of the model is splitting the dataset for training and testing 13 . Some previous literature concludes that a conventional implementation of the dataset-splitting method is known as k-fold cross-validation. K-fold cross-validation is also called v-fold crossvalidation. Generally, k-fold cross-validation is a process of splitting a complete dataset (all the rows/columns) into k-distinct subsets. To execute the experiment, first, k − 1 number of records trained into the desired classifier, and the remaining subset was used to test or validate the model. This experimental process repeated k-times with different folds and used each fold to test the classifier. Then the remaining data of the dataset was used for training purposes. Finally, the overall accuracy of the classifier was calculated using the average/mean calculation of the k number of test accuracy. Equation (3) shows the K-fold accuracy calculation process. In our study, we used tenfold cross-validation, where we considered the value of k = 10 Accuracy(i) www.nature.com/scientificreports/ Model evaluation. Several evaluation metrics were used for the performance measurement of machine learning models. Therefore, to understand and evaluate the performance of our machine learning-based crash severity prediction model, here, we used a confusion matrix for both testing and training output data. A confusion matrix was used to evaluate the performance of our prediction model. It also helps to identify the actual value of our test data. The confusion matrix was beneficial to calculate several other performance measurement techniques, namely classification accuracy, precision, sensitivity, specificity, and F1 Score (Fig. 2, Phase-IV), which shows in Eqs. www.nature.com/scientificreports/ Furthermore, we consider other evaluation matrices such as precision, recall/sensitivity, specificity, and F1 score. Precision is the ratio of correctly predicted crashes as a non-fatal, severe, or extremely severe injury outcome to the total number of correctly predicted crashes as non-fatal, severe, or extremely severe. The precision calculation of non-fatal injury of the crash severity prediction model is shown in Eq. (5). The recall is also known as sensitivity which can be defined as the ratio of correctly predicted crashes as non-fatal, severe, or extremely severe to the total number of actual non-fatal, severe, or extremely severe crashes. Equation (6) represents the process of sensitivity analysis of non-fatal injury. www.nature.com/scientificreports/ Several studies suggested that a high value of precision and recall could predict the best prediction classifier 14 . But, extracting high precision and recall is difficult and sometimes impossible 15 . Therefore, to minimize this problem, the concept of F1 score was introduced that could predict the model's performance more perfectly and be accepted by the researcher. Generally, the F1 score is the harmonic mean of precision and recall, which define through Eq. (7). Furthermore, we also consider the specificity. Specificity is the process of identifying the crashes that are negatively classified as non-fatal, severe, or extremely severe injury crashes. It is the ratio of predicted crashes among the total number of actual predicted crashes as non-fatal, severe, or extremely severe outcomes. The specificity calculation process for non-fatal injury is defined through Eq. (8).

Evaluation result
To evaluate the performance of three machine learning classifiers with different feature selection techniques in crash severity prediction, in this study, we performed the evaluation through two evaluation criteria: first, considering precision and F1 score, and second, through classification accuracy, sensitivity, and specificity. The experimental result with a detailed discussion is described in the following.

Evaluation-1.
To identify the best classifier for the crash severity prediction model, precision and F1 score are effective and recommended by previous researchers. Therefore, this study evaluated our three prediction classifiers (Decision Tree, Random Forest, and Naïve Bayes) results through precision and F1 score statistics. We conducted five tests for each classifier with three feature selection processes: chi-square, Two-way ANOVA test, Regression analysis, chi-square ∪ Two-way ANOVA ∪ Regression analysis, and chi-square ∩ Two-way ANOVA ∩ Regression analysis.
Our prime goal was to identify the effective machine learning classifier for crash severity prediction in this study. Therefore, we evaluated the prediction result through each classifier's weighted average precision measure and weighted average F1 score, considering both feature selection and non-feature selection processes. First, Fig. 7 shows the graphical representation of the experimental result based on the weighted average precision (y-axis) of three classifiers: Decision Tree, Random Forest, and Naïve Bayes (Multinomial and Gaussian) (x-axis). It shows that Random Forest gives the better result for all the feature selection techniques, specifically, providing the high accuracy for {chi-square} and {chi-square ∪ Two-way ANOVA ∪ Regression} with a maximum of 1.0 weighted average precision. Besides, the Decision Tree had the best result for {chi-square ∩ Two-way ANOVA ∩ Regression} feature selection technique with a maximum of 0.77 weighted average precision. In contrast, Multinomial Naïve Bayes and Gaussian Naïve Bayes were found effective without a feature selection process.
Secondly, we compared the prediction result of each classifier according to their weighted average F1 score (y-axis), as shown in Fig. 8. It depicts that, similar to the precision result, Random Forest classifier was found effective for {chi-square} and {chi-square ∪ Two-way ANOVA ∪ Regression} features with maximum weighted average precision. The Decision Tree had the best result for {chi-square ∩ Two-way ANOVA ∩ Regression} feature selection technique with a maximum of 0.77 F1 scores (1.0). Besides, the Decision Tree provided a satisfactory result for {chi-square ∩ Two-way ANOVA ∩ Regression} feature selection technique with a maximum weighted average F1 score (0.78). In contrast, Multinomial Naïve Bayes and Gaussian Naïve Bayes performed better without a feature selection process (0.94 and 1.0 Weighted Avg. F1 score, respectively).

Evaluation-2.
After evaluating the classifier's result through weighted average precision and F1 measure (evaluation-1), we compared the classifier's result through classification accuracy, sensitivity, and specificity. Table 3 represents the outcome of classification accuracy, sensitivity, and specificity of three classifiers: Decision Tree, Random Forest, and Naïve Bayes with feature selection and without a feature selection process. It represents that for chi-square, Two-way ANOVA, and {chi-square

Result and discussion
Evaluation 1 concluded that precision measure and F1 score are equally important for evaluating any crash severity prediction model. Here, we considered both the precision measure and F1 score of each classifier, considering feature selection and non-feature selection techniques. Figure 9 shows that the Random Forest classifier performs better in feature selection than the other four classifiers. Besides, Decision Tree classifies high precision measure and F1 score for {chi-square ∩ Two-way ANOVA ∩ Regression} with the feature selection process. Furthermore, Gaussian Naïve Bayes performs better than other classifiers for the non-feature selection process. Thus, according to precision and F1 score, Random Forest was the best classifier of this crash severity prediction model. www.nature.com/scientificreports/ According to the second evaluation result, classification accuracy, sensitivity, and specificity were equally important for evaluating crash severity prediction models. Therefore, in this study, we emphasized high classification accuracy, sensitivity, and specificity together to evaluate the classifier's effectiveness. Figure 10 shows that the Random Forest classifier performed well for three feature selection processes: chi-square, ANOVA, and {chi-square ∪ Two-way ANOVA ∪ Regression} than other models. The Decision Tree classifier performed better for one feature selection process. In contrast, Multinomial Naïve Bayes and Gaussian Naïve Bayes were effective with maximum classification accuracy, sensitivity, and specificity for non-feature selection or without a feature selection process. Random Forest performed best for the majority of the feature selection process. Regarding chi-square and ANOVA, the result showed that Random Forest performance was comparatively better than other models. However, the analysis was not conclusive in identifying the best classifier in the current study context in terms of regression.
The importance of the feature selection process was described through the comparison result of classification accuracy, sensitivity, and specificity for both feature selection and non-feature selection. The comparison result depicts that the Random Forest classifier had superiority over the feature selection process. The Decision Tree had excellence for {chi-square ∩ Two-way ANOVA ∩ Regression} feature selection process over without feature selection process. In contrast, for the Naïve Bayes classifier, none of the feature selections performed well over the non-feature selection process. Therefore, it concludes that feature selection is vital for Random Forest and Decision Tree but not for the Naïve Bayes classifier.

List of crash factors
Moreover, the evaluation result depicted that feature selection was effective and helped to improve the classification and prediction result 13,15,16,21,27 . Among several feature selection techniques, chi-square and Two-way ANOVA were found as effective in improving the accuracy of our crash severity and risk factor identification model. Therefore, features identified by these two techniques were considered the most responsible factors for influencing crashes. The identified nine risk factors are: Day of crash, Residential location, Vehicle type, License type, Seat belts, Gender, Time of crash, Road surface type, and Road classification. The Standard Deviation statistic was used here to evaluate the identified factors. The detailed results of SD are shown in Table A.1 of the Appendix section. The result specific to these features are discussed in the following sections: Day of crash. Day of crash was found to be one of the most significant factors of crash severity outcome.
SD Result suggests that among 12 months, February (SD-1.68), June (SD-1.631), and October (SD-1.625) experienced a significant number of severe injury crashes, whereas February (SD-0.653) experienced more severe injury crashes than the rest of the months. In contrast, March (SD-0.137) experienced a significant number of non-fatal injury crashes than other months of the year. So, it concluded that severe injury crashes occur more frequently than extremely severe and non-fatal injury crashes throughout the year. To control the injury, traffic safety authorities should control the vehicle on the road more carefully in these months.

Residential location.
From the statistics of 47 places crash data, Dhaka is the place where the probability of severe crashes was higher (SD-5.916) than other places in the country. Also, Bagerhat (SD-1.50), Chattogram (SD-1.50), and Faridpur (SD-1.74) experienced a significant number of extremely severe crashes that would increase the probability of future extremely severe crashes. Gazipur likely had fewer severe and extremely severe crashes but experienced several non-fatal crashes than other places. Traffic authorities should increase their monitoring in this area to control road injuries.
Vehicle type. Based on the findings, we can observe that Auto Rickshaw (SD-6.075), Footpath (SD-9.099), Motorcycle (SD-9.697), Mini-Truck (SD-2.993), and Truck (SD-4.191) were found responsible for increasing the probability of severe crashes and reducing the likelihood of non-fatal or extremely severe crashes. In contrast, maximum severe crashes occurred for Bus (SD-5.046) and Microbus (SD-2.554). In most cases, old buses and microbuses were associated with most crashes. Therefore, traffic authorities should focus on the fitness of vehicles to reduce accidents. Tractors, cars, and Bicycles identify with being involved in fewer medium and mild injury crashes.
License type. License type plays a vital role in crash severity analysis. In this study, license type is categorized into P-professional (those who have licensed) and NP-nonprofessional (those who have no license). The statistic showed that severe crashes (SD-12.32) occur more frequently for nonprofessional drivers than professional drivers (SD-9.599). But for the professional driver, non-fatal and extremely severe crashes were higher than a nonprofessional driver with SD-0.533 for non-fatal and SD-4.799 for extremely severe crashes. These may be due to not abiding by the traffic rules, speed limits, or overconfident attitudes. Traffic authorities should take steps to increase awareness about traffic rules and their importance.
Seat belts. One of the important factors for road crashes is seat belts. The analysis showed that not wearing seat belts was more likely to cause severe, non-fatal, and extremely severe crashes with SD 15.401, 0.393, 4.477 (Appendix Table A.1), respectively. Government and transportation safety policymakers should encourage people to wear seat belts to minimize unexpected crash consequences.
Gender. Male drivers were more likely to be involved in severe and extremely severe crashes with SD- 14 www.nature.com/scientificreports/ Time of crash. To identify the time of crashes, we considered daytime and nighttime to investigate our crash data. Statistics showed that for severe (SD-13.151) and extremely severe (SD-4.402) crashes, the probability of crash occurrence in the daytime was higher than at night. Generally, traffic volume was more in the daylight than at night time. Therefore, people are running out of time and intend to drive at high speed or lack concern about traffic rules that might increase crashes. Traffic authorities should improve their monitoring at night to control crashes on the road.

Road surface type.
We focused on five types of road surfaces such as Diamond Interchange, Zig Zag Road, J Turns, Circular Road, and Normal Road to analyze crash data. We found that severe and extremely severe crashes were more likely to occur on Normal Road surfaces compared to other road surfaces with SD11.03 and 3.497, respectively. Also, the Circular roadway surface was responsible for a large number of non-fatal crashes with SD 0.328.
Road classification. We found that crashes were more likely to be severe when considering National Highway (NH) with 8.895 SD. Compared to other road classes: extremely severe crashes were more likely to happen in Regional Highway (RH) road class with SD 3.054. In contrast, Union Roads was more likely to be responsible for non-fatal crashes.

Implications
Nowadays, crash severity prediction is a global issue for the government and road safety authorities. In this study, machine learning techniques with a feature selection process efficiently predict crash severity and help to identify responsible factors associated with crashes. Predicting crash severity levels may be considered crucial information for assessing crash risk factors. Using this model, the road safety authority can identify the responsible factors for non-fatal, severe, and extremely severe risk crashes, which is sometimes difficult to understand by road safety professionals and other policymakers. Therefore, this study might be helpful for accident prevention. However, it is noteworthy that the prediction result is not a complete absolute prediction and may vary in different situations. Therefore, safety professionals should carefully monitor the responsible factors identified in this study. Moreover, this study would assist safety professionals in understanding the unrevealed information and predicting associate attributes that influence crash severity.

Conclusion and future work
In this paper, we present a machine learning-based crash severity prediction model from the perspective of road crashes in Bangladesh. This model analyzes four hundred and forty-one driver crashes in different parts of Bangladesh. The study went through data collecting, data preprocessing, crash data coding, three machine learning model implementations, comprehensive data analysis, and identifying contributing factors to crash severity. After implementing and analyzing, we have the following major findings: 1. Feature selection is prominent with machine learning models to identify the responsible factors and improve model performance. Chi-square and Two-way ANOVA have been found significant in examining the crash data. 2. Machine learning models are beneficial to uncover new insights into heterogeneous crash datasets. Random Forest and Decision Tree seem to be effective options for predicting crash severity for better prediction performance. 3. The features identified as responsible factors in our crash severity prediction model are the ' day of crash' , 'residential location' , 'vehicle type' , 'license type' , 'seat belts' , 'gender' , 'time of crash' , 'road surface type' , and 'road classification' . 4. According to the research findings (crash factors), it could be argued that people might not have adequate knowledge of safe driving and perhaps are unwilling to follow traffic rules, such as refusing to wear seat belts, driving without having a license, using unfit vehicles, lack knowledge about traffic rules associated with road types (National Highway (NH) and Regional Highway (RH)), road surface types (Diamond Interchange, Zig Zag Road, J Turns, Circular Road, and Normal Road), area type (commercial and residential area), and speed limits during day time and night time. Police and government authorities should spread awareness about traffic rules and safety issues, impose traffic rules more strictly and monitor vehicle movement and vehicle fitness on a regular basis. It might help to reduce the upward trend of crashes in Bangladesh.
Future work would consider other machine learning and neural network models. For instance, we want to work with different types of machine learning algorithms: Support Vector Machine 12 , Logistic Regression 21 , K-Nearest Neighbor 18 , and neural networks models: Artificial Neural Network 15 , Feedforward Neural Network 6 , and Multilayer Perceptron 10 . We intend to increase the dimension of the dataset, considering other factors that are likely to be responsible for crashes. Therefore, we will collect data from different sources (accident reports and police feedback) and focus on data imbalance handling 18 , which might improve the accuracy of our future studies. Additionally, the compiled data will be validated against other official open-source data, which is one of the prime concerns of future work.

Data availability
All data generated or analyzed during this study are included in this published article. It is also available in-BD_Road_Crash_Data.